
💡 📢 ☑️ remember to read the readme.md file for helpful hints on the best ways to view/navigate this project
If you visualize this notebook on github you will be missing important content
Some charts/diagrams/features are not visible in github. This is standard and well-known behaviour.
Consider viewing the pre-rendered HTML files, or run all notebooks end to end after enabling the feature flags that control long running operations:
If you chose to run this locally, there are some prerequisites:
python 3.9pip install -r requirements.txt before proceeding.(provided by Turing College)
You and your friend came up with a brilliant startup idea - provide risk evaluation as a service for retail banks. As with most successful startup teams, both of you have your specialty. Your friend is responsible for sales and operations, while you are responsible for everything product-related, from planning to data analysis to building the solution. You have quickly identified that machine learning will be an essential part of your offering because you believe that the models can capture statistical patterns in the defaults on bank loans.
You decide to start your investigation by downloading this dataset from Home Credit Group. You are not yet sure, what is the most crucial problem for your potential clients, so you had a meeting with your friend to discuss how your proof-of-concept (POC) product should look like. After a lot of arguing, you both agreed to create a number of different models so that you have a robust and diversified offering when you get your first meeting with the potential clients. You are eager to investigate the dataset and see what you can predict, so you propose that you come up with interesting features to analyze and predict - this way, you'll focus on building a solid offering, and she can work on getting meetings with the banks.
import ipywidgets as widgets
from IPython.display import display, Markdown, Image, clear_output, HTML, IFrame
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle
from mpl_toolkits.mplot3d import Axes3D
import itertools
import numpy as np
import pandas as pd
import polars as pl
import sqlite3
import seaborn as sns
from scipy import stats
from scipy.stats import chi2_contingency
import missingno as msno
import klib
import statsmodels.api as sm
from random import random, seed
import sqlite3 as lite
import logging
import warnings
import ydata_profiling
import iplantuml
import xml.dom.minidom
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.metrics import (
ConfusionMatrixDisplay,
accuracy_score,
recall_score,
precision_score,
)
from autofeatselect import CorrelationCalculator, FeatureSelector, AutoFeatureSelect
from sklearn.preprocessing import MinMaxScaler, StandardScaler, FunctionTransformer
from sklearn.impute import SimpleImputer
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.tree import plot_tree
from sklearn.metrics import make_scorer, confusion_matrix, PrecisionRecallDisplay
from sklearn.pipeline import Pipeline
from sklearn.impute import KNNImputer
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.base import clone as clone_pipeline
from sklearn.neighbors import NearestNeighbors
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
import shap
from imblearn.under_sampling import RandomUnderSampler
import joblib
import os
from os import path
from utils import *
from utils import __
from autofeat import AutoFeatRegressor as AFR
from sklearn.model_selection import train_test_split as sklearn_train_test_split
from sklearn.exceptions import DataConversionWarning
import sweetviz as sv
import featuretools as ft
import woodwork as ww
import json
import subprocess
import category_encoders as ce
seed(100)
pd.options.display.max_rows = 100
pd.options.display.max_colwidth = 50
util.check("done")
loading utils modules... ✅ completed configuring autoreload... ✅ completed ✅
%reload_ext credit_clean
%reload_ext credit_utils
import credit_clean
import credit_utils
Let's use black to auto-format all our cells so they adhere to PEP8
import lab_black
%reload_ext lab_black
util.patch_nb_black()
# fmt: off
# fmt: on
from sklearn import set_config
set_config(transform_output="pandas")
sns.set_theme(context="notebook", style="whitegrid")
moonstone = "#62b6cb"
moonstone_rgb = handy_utils.hex_to_rgb(moonstone)
moonstone_rgb_n = np.array(moonstone_rgb) / 255
logger = util.configure_logging(jupyterlab_level=logging.WARN, file_level=logging.DEBUG)
warnings.filterwarnings("ignore", category=FutureWarning)
# import warnings
# warnings.filterwarnings('error', category=pd.errors.DtypeWarning)
def ding(title="Ding!", message="Task completed"):
"""
this method only works on linux
"""
for i in range(2):
!notify-send '{title}' '{message}'
Let's also create a simple feature toggle that we can use to skip expensive operations during notebook work (to save myself some time!)
Set it to true if you want to run absolutely everything. Set to false to skip optional steps/exploratory work.
def run_entire_notebook(filename: str = None):
run_all = False
if not run_all:
print("skipping optional operation")
fullpath = f"cached/printouts/{filename}.txt"
if filename is not None and os.path.exists(fullpath):
print("==== 🗃️ printing cached output ====")
with open(fullpath) as f:
print(f.read())
return run_all
We will also create a small settings file to store all of the tweaks/configs we might need during production (TRAIN/FIT/DEPLOYED MODEL) so we can guarantee that we rely on this instead of having to remember/risking misconfiguring
prod_settings = ProdSettings("prod/settings.json")
Our business wants to help lenders by providing risk evaluation services.
We can leverage our skills as Data Scientists to build various models and evaluating their performance.
We know that this business domain (banking/loans/risk management) generally handles imbalanced datasets, so some of the metrics are immediately discarded (accuracy, is the first metric to go out the window!).
There is the temptation to use generic metrics that work "well enough" for imbalanced datasets, but for this project we want to take a step back and look at the bigger picture:
We want to measure our success in the same way that our clients measure it.
That means that we should be familiar with their criteria for success.
We should use a custom metric. The reasoning is that it is unlikely they measure success "with a f1_score() function".
If our client were a regular bank, they would be more likely to measure success based on overall cash balance (are we making enough money? are we being too risky with our bets and losing out? are we being too conservative and leaving money on the table? what about opportunity costs?)
But that's not our client. Our client Home Credit Group is an international non-bank financial institution founded in 1997 which operates in 9 countries and focuses on installment lending primarily to people with little or no credit history.
Kaggle explains that Home Credit:
strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience.
If we want to be good at helping Home Credit Group, we need to start thinking like Home Credit Group!
Based on the description on Kaggle, it seems that they care less about being extremely highly profitable and more about actually helping people and uplifting communities.
We should use a metric that reflects these values. Our task as Data Scientists is to encode these values and tell the predictive models to measure success like our clients measure it, so that when the optimize and tune them, the models take into account these guiding principles.
Looking at the data, it's pretty obvious that this project is a great place to use machine learning to make predictions on the loans.
There is a temptation to think that we can just brute force this. For example, we could think: "If we use various scoring metrics and then train various models using each of these metrics, we can see which one gets better scores and pick that metric and model. The slides will be fantastic!"... However, this is a trap.
We will try to explain why it is, using an analogy:
Suppose we are a creating a "Personal Shopper bot" that will help clients update their wardrobe. A generic approach might be to choose clothes that are in fashion and high-quality. However, this might not align with the client's personal style or comfort. Maybe they are looking for comfy stay-at-home clothes!
It would be incorrect to create various personal shopper bots, train each of them in different criteria (pick only fashionable clothes, pick only high-quality clothes), and then compare each of the bots and see how they perform according to each of the specific functions. These scores tells us nothing about how things will work on the real world (what do actual customers need and value?). They only tell us how well the model is at fitting the data, according to that function, but it does not mean that real customers will actually be guided by any of these function.
This is what such situation could look like:

I hope this picture illustrates why Model 5 is the correct model to use: it's the one who has the highest score, using the criteria that our customer uses.
For this reason, we will need to make a judgement call right now, as part of this project's storytelling: We need to find out (= "decide") what our company values.
We want to make as many correct guesses as possible, but we also want to recognize that some errors are worse than others...
That is:
| predicted 0 - "no struggle" | predicted 1 - "struggle" | |
|---|---|---|
| actual 0 - "no struggle" | ➕ ✅ GOOD | ➖ ❌ NOT GREAT |
| actual 1 - "struggle" | ➖ ❌ TERRIBLE | ➕ ✅ GOOD |
False negatives (predict "no struggle" but the customer did struggle to repay) are worse than the opposite (False Positives).
We are using this terminology to match the criteria used in the TARGET column:
- 0 = no difficulties
- 1 = had difficulties
While we want to use different weights for the errors, we want to make sure that we don't accidentally make FN infinitely worse than FP, as this would result in a model that leans too hard in that direction.
We want to use the right weights as a tool to help us improve our model so it scores best (as many correct guesses as possible).
So:
Because this is a serious enterprise, we don't want to deploy a model that performs poorly. But we also want to recognize that this is iteration 1.0 for this new finance division, so we will set a goal that is ambitious but also realistic.
PS: Spoiler - We don't know how this project will go, but since it's a learning project and it requires a deployment to Google Cloud, we will ultimately deploy whatever best models we can make, even if they do not reach our target. But in a real business environment, we would be doing this at this stage: engaging the business decision-makers and deciding the go/no-go cut off before we even begin with the technical work.
Now that we have defined our criteria for success, we can finally get started!
💡 If you don't have credentials for kaggle, you can just download the dataset from the links above, in the Context-Requirements section
competition_name = "home-credit-default-risk"
db_filename = "HomeCredit_columns_description.csv"
auto_kaggle.download_dataset(
competition_name, db_filename, timeout_seconds=120, is_competition=True
)
__
Kaggle API 1.5.13 - login as 'edualmas' File [dataset/HomeCredit_columns_description.csv] already exists locally! No need to re-download data for competition [home-credit-default-risk]
We want to get a general understanding of the data that we have been provided:
[x] Get an understanding of the data we have
application_* table?[x] Format data so we can easily retrieve it later
Let's rename and standardize the names so they are all using lowercase camel_case for all datasets and fields
if run_entire_notebook():
!mv dataset/POS_CASH_balance.csv dataset/pos_cash_balance.csv
skipping optional operation
all_csvs = {
"bureau_balance.csv": "ascii",
"credit_card_balance.csv": "ascii",
"installments_payments.csv": "ascii",
"previous_application.csv": "ascii",
"application_train.csv": "ascii",
"application_test.csv": "ascii",
"bureau.csv": "ascii",
"pos_cash_balance.csv": "ascii",
}
if run_entire_notebook("lower_case_field_names"):
with fs_utils.in_subdir("dataset"):
for csvfile in all_csvs:
print(f"converting headers to lower case:", csvfile)
output_filename = "lower_" + csvfile
command = f"awk 'NR==1 {{print tolower($0)}} NR!=1 {{print $0}}' {csvfile} > {output_filename}"
!mv $output_filename $csvfile
subprocess.run(command, shell=True, check=True)
skipping optional operation ==== 🗃️ printing cached output ==== converting headers to lower case: bureau_balance.csv converting headers to lower case: credit_card_balance.csv converting headers to lower case: installments_payments.csv converting headers to lower case: previous_application.csv converting headers to lower case: application_train.csv converting headers to lower case: application_test.csv converting headers to lower case: bureau.csv converting headers to lower case: pos_cash_balance.csv
Let's import all of the data to a sqlite database, so we can query it easily
def import_all_csvs_into_sqlite():
if os.path.exists("dataset/db.sqlite"):
print("deleting existing file: dataset/db.sqlite")
!rm dataset/db.sqlite
for csv in map(lambda x: f"dataset/{x}", all_csvs.keys()):
print(f"importing {csv}")
!csvs-to-sqlite $csv dataset/db.sqlite
if run_entire_notebook("merge_csvs"):
import_all_csvs_into_sqlite()
skipping optional operation ==== 🗃️ printing cached output ==== importing dataset/application_test.csv Loaded 1 dataframes Created dataset/db.sqlite from 1 CSV file importing dataset/bureau_balance.csv Loaded 1 dataframes Added 1 CSV file to dataset/db.sqlite importing dataset/credit_card_balance.csv Loaded 1 dataframes Added 1 CSV file to dataset/db.sqlite importing dataset/installments_payments.csv Loaded 1 dataframes Added 1 CSV file to dataset/db.sqlite importing dataset/previous_application.csv Loaded 1 dataframes Added 1 CSV file to dataset/db.sqlite importing dataset/application_train.csv Loaded 1 dataframes Added 1 CSV file to dataset/db.sqlite importing dataset/bureau.csv Loaded 1 dataframes Added 1 CSV file to dataset/db.sqlite importing dataset/POS_CASH_balance.csv Loaded 1 dataframes Added 1 CSV file to dataset/db.sqlite
This is the ERD diagram that we can see in the official documentation.
It seems that the 2 main tables are application_train and application_test, and that these two act as the central table in a pseudo "star schema"/snowflake (depending on how picky we want to get with the definitions).
A few of the key assumptions/understandings we have right now is:
From the docs:
- application_{train|test}.csv
- This is the main table, broken into two files for Train (with TARGET) and Test (without TARGET).
- Static data for all applications. One row represents one loan in our data sample.
We will validate our assumptions soon

@cached_dataframe()
def eda_check_all_tables_have_data():
all_tables = [
"application_train",
"application_test",
"bureau",
"bureau_balance",
"credit_card_balance",
"installments_payments",
"previous_application",
"pos_cash_balance",
]
with sql_utils.AutoClosing("dataset/db.sqlite") as original_dataset_conn:
summary = {}
for table in all_tables:
c = original_dataset_conn.cursor()
q = f"SELECT count(*) FROM {table}"
c.execute(q)
rowcount = c.fetchone()[0]
summary[table] = rowcount
return pd.DataFrame(summary.values(), index=summary.keys(), columns=["row_count"])
rowcounts = eda_check_all_tables_have_data()
number_of_tables_with_0_rows = rowcounts[rowcounts == 0].count()
assert number_of_tables_with_0_rows.sum() == 0
display(rowcounts)
Loading from cache [./cached/df/eda_check_all_tables_have_data.parquet]
| row_count | |
|---|---|
| application_train | 307511 |
| application_test | 48744 |
| bureau | 1716428 |
| bureau_balance | 27299925 |
| credit_card_balance | 3840312 |
| installments_payments | 13605401 |
| previous_application | 1670214 |
| POS_CASH_balance | 10001358 |
Considering this a capstone project, we already know that this will be imbalanced.
The question is "how terrible" will the imbalance be?
@run
@cached_with_pickle()
def eda_query_target_balance():
with sql_utils.AutoClosing("dataset/db.sqlite") as original_dataset_conn:
original_dataset_conn = sql_utils.Sql(original_dataset_conn)
q = "SELECT target, count(*) FROM application_train GROUP BY target"
return original_dataset_conn.peek("", q, limit=15)
Loading from cache [./cached/pickle/eda_query_target_balance.pickle]
| TARGET | count(*) | |
|---|---|---|
| 0 | 0 | 282686 |
| 1 | 1 | 24825 |
⚠️ This dataset is quite imbalanced!
We must keep this imbalance in mind (~ 12-to-1) during the rest of the steps.
2-to-1 would be noteworthy, >10-to-1 is going to be a challenge to handle delicately.
The last thing we wanted to check was the first assumption we made "there is no overlap between train/test IDs".
Since these IDs are used to relate info across various tables, we expect 0 rows to show up below. If anything shows up, we have a problem with a potentially corrupted dataset or a malformed train/test split:
@cached_with_pickle()
def eda_query_train_test_overlap():
with sql_utils.AutoClosing("dataset/db.sqlite") as original_dataset_conn:
original_dataset_conn = sql_utils.Sql(original_dataset_conn)
q = (
"SELECT SK_ID_CURR FROM application_train"
" INTERSECT "
"SELECT SK_ID_CURR FROM application_test"
)
return original_dataset_conn.peek("", q, limit=15)
data_leakage_rowcount = eda_query_train_test_overlap()
display(data_leakage_rowcount)
print("We expect this table to be empty (0 rows)", end=": ")
util.check(data_leakage_rowcount.shape[0] == 0)
Loading from cache [./cached/pickle/eda_query_train_test_overlap.pickle]
| SK_ID_CURR |
|---|
We expect this table to be empty (0 rows): ✅
def check_no_data_overlaps(conn, tables=None):
if not tables:
tables = ["training", "val", "test", "hyper"]
dfcols = {}
for t1 in tables:
results = []
for t2 in tables:
sql = f"""
SELECT count()
FROM (
SELECT SK_ID_CURR
FROM APPLICATION_{t1}
INTERSECT
SELECT SK_ID_CURR
FROM APPLICATION_{t2}
);
"""
cursor = conn.cursor()
cursor.execute(sql)
result = cursor.fetchone()[0]
results.append(result)
dfcols[t1] = results
return pd.DataFrame(dfcols, index=tables)
@cached_with_pickle()
def eda_data_overlap_original_dataset():
with sql_utils.AutoClosing("dataset/db.sqlite") as original_dataset_conn:
return check_no_data_overlaps(original_dataset_conn, ["TRAIN", "TEST"])
display(eda_data_overlap_original_dataset())
Loading from cache [./cached/pickle/eda_data_overlap_original_dataset.pickle]
| TRAIN | TEST | |
|---|---|---|
| TRAIN | 307511 | 0 |
| TEST | 0 | 48744 |
Let's remember this 307511 because we will need it later to make sure no data got lost!
It seems that there is no overlap between the TRAIN and the TEST datasets.
This means that if we split the data using SK_ID_CURR from the right table, we are guaranteed to not have any piece of data out of place.
This dataset contains 2 distinct Application tables: application_train and application_test.
Since this data comes from a Kaggle Competition, the test split does not contain any TARGET column, which means we cannot actually use it for testing how our model performs on unseen data.
Normally, it'd be fun to try to submit our predictions to kaggle to see how we did. However, even this is not available to us. Since the competition is from a few years ago, it's already closed and it does not allow new predictions to be submitted/scored.
This means that instead of using the kaggle-provided split for our train/test, we will likely have to drop the test partition, and create sub-datasets from the train ourselves.
def generate_delete_statement(table_name, valid_tablename):
if table_name == "bureau_balance":
return """DELETE FROM bureau_balance
WHERE bureau_balance.SK_ID_BUREAU NOT IN (
SELECT bureau.SK_ID_BUREAU FROM bureau
);"""
else:
return f"""DELETE FROM {table_name}
WHERE SK_ID_CURR NOT IN (SELECT SK_ID_CURR FROM {valid_tablename});"""
def drop_data(conn: sqlite3.Connection, split_name: str) -> None:
valid_tablename = f"application_{split_name}"
secondary_tables = [
"bureau",
"bureau_balance",
"credit_card_balance",
"installments_payments",
"previous_application",
"pos_cash_balance",
]
sql_statements = {
table: generate_delete_statement(table, valid_tablename)
for table in secondary_tables
}
for table, sql in sql_statements.items():
print(f"Deleting unnecessary data from... {table}: ", end="")
cursor = conn.cursor()
cursor.execute(sql)
rows = cursor.rowcount
print(f"{rows:,} rows deleted")
all_partitions = [
"application_hyper",
"application_val",
"application_test",
"application_training",
]
for tb in set(all_partitions) - {valid_tablename}:
q = f"DROP TABLE IF EXISTS {tb};"
print(q)
cursor.execute(q)
conn.commit()
print("vacuuming db")
conn.execute("VACUUM")
print("vacuum completed")
def create_db(split_name: str) -> None:
print(f"Moving all data for {split_name} into its own table")
filename = f"dataset/split_{split_name}.sqlite"
!cp dataset/db.sqlite $filename
with sql_utils.AutoClosing(filename) as conn:
drop_data(conn, split_name)
print(f"Completed. The database [{filename}] only contains {split_name} data")
if run_entire_notebook("delete_test_split"):
create_db("train")
!rm "dataset/db.sqlite"
!mv "dataset/split_train.sqlite" "dataset/db.sqlite"
skipping optional operation ==== 🗃️ printing cached output ==== Moving all data for train into its own table Deleting unnecessary data from... bureau: 251,103 rows deleted Deleting unnecessary data from... bureau_balance: 12,598,313 rows deleted Deleting unnecessary data from... credit_card_balance: 612,347 rows deleted Deleting unnecessary data from... installments_payments: 0 rows deleted Deleting unnecessary data from... previous_application: 256,513 rows deleted Deleting unnecessary data from... POS_CASH_balance: 1,457,983 rows deleted DROP TABLE IF EXISTS application_hyper; DROP TABLE IF EXISTS application_val; DROP TABLE IF EXISTS application_training; DROP TABLE IF EXISTS application_test; vacuuming db vacuum completed Completed. The database [dataset/split_train.sqlite] only contains train data
Now that we have removed all the data that we were not going to be able to use, we need to create new data splits.
Not only we need new train and test splits, we will also need a couple more independent datasets for this project:
train dataset: for most of the model traininghyperparam dataset: for hyperparameter tuningval dataset: a validation set to perform validations on unseen datatest dataset: a final test split to make the final deploy/kill decision for the selected model.We will use a custom function to create these 4 splits from the data that was originally marked as training.
Let's create some utility functions that we will need
def check_no_data_got_lost(conn):
cte = """
WITH new_tables AS ( SELECT * FROM application_training
UNION ALL
SELECT * FROM application_hyper
UNION ALL
SELECT * FROM application_val
UNION ALL
SELECT * FROM application_test ),
original_table AS ( SELECT * FROM application_original )
"""
check_1 = f"""{cte} SELECT * FROM new_tables EXCEPT SELECT * FROM original_table;"""
check_2 = f"""{cte} SELECT * FROM original_table EXCEPT SELECT * FROM new_tables;"""
cursor = conn.cursor()
cursor.execute(check_1)
result1 = cursor.fetchone()
cursor.execute(check_2)
result2 = cursor.fetchone()
if result1:
print(
"the new tables have more data than the original table (check for possible data corruption)"
)
if result2:
print(
"the new tables have less data than the original table (some data has been lost)"
)
return not (result1 or result2)
def prepare_db_for_split(cursor):
cursor.execute("DROP TABLE IF EXISTS application_training;")
cursor.execute("DROP TABLE IF EXISTS application_hyper;")
cursor.execute("DROP TABLE IF EXISTS application_val;")
cursor.execute("DROP TABLE IF EXISTS application_test;")
q = "SELECT name FROM sqlite_master WHERE type='table' AND name='application_train';"
cursor.execute(q)
print(q)
original_exists = cursor.fetchone()
print("original_exists:", original_exists)
if original_exists:
q = "ALTER TABLE application_train RENAME TO application_original;"
print(q)
cursor.execute(q)
Then we can split our data into 4 parts, ensuring that the data is randomly shuffled, and that no data is accidentally reused twice, or accidentally avoided. All rows will end up in 1 split, and 1 split only.
if run_entire_notebook("split_into_4_datasets"):
with sql_utils.AutoClosing("dataset/db.sqlite") as original_db_conn:
cursor = original_db_conn.cursor()
prepare_db_for_split(cursor)
tr = pd.read_sql("SELECT * FROM application_original;", original_db_conn)
splits = split_utils.split_dataset(
tr, "TARGET", {"training": 0.7, "hyper": 0.1, "val": 0.1, "test": 0.1}
)
for name, (_x, _y) in splits.items():
print(f"storing {name} split. {_x.shape[0]} rows")
_df = _x.copy()
_df["TARGET"] = _y
cols = list(_df.columns)
cols = cols[:-1]
cols.insert(1, _df.columns[-1])
_df = _df[cols]
_df.to_sql(f"application_{name}", original_db_conn, index=False)
if check_no_data_got_lost(original_db_conn):
cursor.execute("DROP TABLE IF EXISTS application_original;")
print("all data has been accounted for ✅")
else:
print(
"something failed. skipping deleting original data so it can be used for inspection"
)
skipping optional operation
==== 🗃️ printing cached output ====
SELECT name FROM sqlite_master WHERE type='table' AND name='application_train';
original_exists: ('application_train',)
ALTER TABLE application_train RENAME TO application_original;
storing training split. 215258 rows
storing hyper split. 30751 rows
storing val split. 30751 rows
storing test split. 30751 rows
all data has been accounted for ✅
Alright! things look good.
Let's compare the data with the original dataset, to make sure that we didn't lose anything. Recall that this is what the data looked like:
Let's compare it to the 4-way split we just created from the TRAIN data (we dropped TEST data because it was unusable to us)
@cached_dataframe()
def eda_check_no_overlap():
with sql_utils.AutoClosing("dataset/db.sqlite") as original_db_conn:
return check_no_data_overlaps(original_db_conn)
display(eda_check_no_overlap())
Loading from cache [./cached/df/eda_check_no_overlap.parquet]
| training | val | test | hyper | |
|---|---|---|---|---|
| training | 215258 | 0 | 0 | 0 |
| val | 0 | 30751 | 0 | 0 |
| test | 0 | 0 | 30751 | 0 |
| hyper | 0 | 0 | 0 | 30751 |
train_data_from_original_dataset = eda_data_overlap_original_dataset()["TRAIN"].sum()
all_data_after_split = eda_check_no_overlap().sum().sum()
util.check(train_data_from_original_dataset == all_data_after_split)
Loading from cache [./cached/pickle/eda_data_overlap_original_dataset.pickle] Loading from cache [./cached/df/eda_check_no_overlap.parquet] ✅
Now that we have all the data split into 4 tables, we can create separate files so we can just load the data that we need (and avoid peeking in the wrong place)
if run_entire_notebook("split_each_dataset_to_its_own_db"):
create_db("hyper")
create_db("val")
create_db("test")
create_db("training")
skipping optional operation ==== 🗃️ printing cached output ==== Moving all data for hyper into its own table Deleting unnecessary data from... bureau: 1,319,062 rows deleted Deleting unnecessary data from... bureau_balance: 13,226,798 rows deleted Deleting unnecessary data from... credit_card_balance: 2,905,033 rows deleted Deleting unnecessary data from... installments_payments: 0 rows deleted Deleting unnecessary data from... previous_application: 1,272,436 rows deleted Deleting unnecessary data from... POS_CASH_balance: 7,690,785 rows deleted DROP TABLE IF EXISTS application_val; DROP TABLE IF EXISTS application_training; DROP TABLE IF EXISTS application_test; vacuuming db vacuum completed Completed. The database [dataset/split_hyper.sqlite] only contains hyper data Moving all data for val into its own table Deleting unnecessary data from... bureau: 1,317,865 rows deleted Deleting unnecessary data from... bureau_balance: 13,201,811 rows deleted Deleting unnecessary data from... credit_card_balance: 2,907,762 rows deleted Deleting unnecessary data from... installments_payments: 0 rows deleted Deleting unnecessary data from... previous_application: 1,271,947 rows deleted Deleting unnecessary data from... POS_CASH_balance: 7,682,249 rows deleted DROP TABLE IF EXISTS application_hyper; DROP TABLE IF EXISTS application_training; DROP TABLE IF EXISTS application_test; vacuuming db vacuum completed Completed. The database [dataset/split_val.sqlite] only contains val data Moving all data for test into its own table Deleting unnecessary data from... bureau: 1,318,854 rows deleted Deleting unnecessary data from... bureau_balance: 13,230,887 rows deleted Deleting unnecessary data from... credit_card_balance: 2,911,086 rows deleted Deleting unnecessary data from... installments_payments: 0 rows deleted Deleting unnecessary data from... previous_application: 1,274,191 rows deleted Deleting unnecessary data from... POS_CASH_balance: 7,698,680 rows deleted DROP TABLE IF EXISTS application_hyper; DROP TABLE IF EXISTS application_val; DROP TABLE IF EXISTS application_training; vacuuming db vacuum completed Completed. The database [dataset/split_test.sqlite] only contains test data Moving all data for training into its own table Deleting unnecessary data from... bureau: 440,194 rows deleted Deleting unnecessary data from... bureau_balance: 4,445,340 rows deleted Deleting unnecessary data from... credit_card_balance: 960,014 rows deleted Deleting unnecessary data from... installments_payments: 0 rows deleted Deleting unnecessary data from... previous_application: 422,529 rows deleted Deleting unnecessary data from... POS_CASH_balance: 2,558,411 rows deleted DROP TABLE IF EXISTS application_hyper; DROP TABLE IF EXISTS application_val; DROP TABLE IF EXISTS application_test; vacuuming db vacuum completed Completed. The database [dataset/split_training.sqlite] only contains training data
Now we can delete the intermediate tables
Last thing we want to do is add some indexes to speed up our queries and joins:
if run_entire_notebook("create_indexes"):
for split in ["training", "hyper", "val", "test"]:
with sql_utils.AutoClosing(f"dataset/split_{split}.sqlite") as split_conn:
c = sql_utils.Sql(split_conn)
print("-------")
c.create_index("main", f"application_{split}", "SK_ID_CURR")
c.create_index("main", "bureau", "SK_ID_CURR")
c.create_index("main", "bureau", "SK_ID_BUREAU")
c.create_index("main", "bureau_balance", "SK_ID_BUREAU")
c.create_index("main", "credit_card_balance", "SK_ID_CURR")
c.create_index("main", "credit_card_balance", "SK_ID_PREV")
c.create_index("main", "installments_payments", "SK_ID_CURR")
c.create_index("main", "installments_payments", "SK_ID_PREV")
c.create_index("main", "installments_payments", "num_instalment_number")
c.create_index("main", "installments_payments", "num_instalment_version")
c.create_index("main", "pos_cash_balance", "SK_ID_CURR")
c.create_index("main", "pos_cash_balance", "SK_ID_PREV")
c.create_index("main", "previous_application", "SK_ID_CURR")
c.create_index("main", "previous_application", "SK_ID_PREV")
skipping optional operation ==== 🗃️ printing cached output ==== ------- creating index on main.application_training ['SK_ID_CURR']... done ✅ creating index on main.bureau ['SK_ID_CURR']... done ✅ creating index on main.bureau ['SK_ID_BUREAU']... done ✅ creating index on main.bureau_balance ['SK_ID_BUREAU']... done ✅ creating index on main.credit_card_balance ['SK_ID_CURR']... done ✅ creating index on main.credit_card_balance ['SK_ID_PREV']... done ✅ creating index on main.installments_payments ['SK_ID_CURR']... done ✅ creating index on main.installments_payments ['SK_ID_PREV']... done ✅ creating index on main.installments_payments ['num_instalment_number']... done ✅ creating index on main.installments_payments ['num_instalment_version']... done ✅ creating index on main.POS_CASH_balance ['SK_ID_CURR']... done ✅ creating index on main.POS_CASH_balance ['SK_ID_PREV']... done ✅ creating index on main.previous_application ['SK_ID_CURR']... done ✅ creating index on main.previous_application ['SK_ID_PREV']... done ✅ ------- creating index on main.application_hyper ['SK_ID_CURR']... done ✅ creating index on main.bureau ['SK_ID_CURR']... done ✅ creating index on main.bureau ['SK_ID_BUREAU']... done ✅ creating index on main.bureau_balance ['SK_ID_BUREAU']... done ✅ creating index on main.credit_card_balance ['SK_ID_CURR']... done ✅ creating index on main.credit_card_balance ['SK_ID_PREV']... done ✅ creating index on main.installments_payments ['SK_ID_CURR']... done ✅ creating index on main.installments_payments ['SK_ID_PREV']... done ✅ creating index on main.installments_payments ['num_instalment_number']... done ✅ creating index on main.installments_payments ['num_instalment_version']... done ✅ creating index on main.POS_CASH_balance ['SK_ID_CURR']... done ✅ creating index on main.POS_CASH_balance ['SK_ID_PREV']... done ✅ creating index on main.previous_application ['SK_ID_CURR']... done ✅ creating index on main.previous_application ['SK_ID_PREV']... done ✅ ------- creating index on main.application_val ['SK_ID_CURR']... done ✅ creating index on main.bureau ['SK_ID_CURR']... done ✅ creating index on main.bureau ['SK_ID_BUREAU']... done ✅ creating index on main.bureau_balance ['SK_ID_BUREAU']... done ✅ creating index on main.credit_card_balance ['SK_ID_CURR']... done ✅ creating index on main.credit_card_balance ['SK_ID_PREV']... done ✅ creating index on main.installments_payments ['SK_ID_CURR']... done ✅ creating index on main.installments_payments ['SK_ID_PREV']... done ✅ creating index on main.installments_payments ['num_instalment_number']... done ✅ creating index on main.installments_payments ['num_instalment_version']... done ✅ creating index on main.POS_CASH_balance ['SK_ID_CURR']... done ✅ creating index on main.POS_CASH_balance ['SK_ID_PREV']... done ✅ creating index on main.previous_application ['SK_ID_CURR']... done ✅ creating index on main.previous_application ['SK_ID_PREV']... done ✅ ------- creating index on main.application_test ['SK_ID_CURR']... done ✅ creating index on main.bureau ['SK_ID_CURR']... done ✅ creating index on main.bureau ['SK_ID_BUREAU']... done ✅ creating index on main.bureau_balance ['SK_ID_BUREAU']... done ✅ creating index on main.credit_card_balance ['SK_ID_CURR']... done ✅ creating index on main.credit_card_balance ['SK_ID_PREV']... done ✅ creating index on main.installments_payments ['SK_ID_CURR']... done ✅ creating index on main.installments_payments ['SK_ID_PREV']... done ✅ creating index on main.installments_payments ['num_instalment_number']... done ✅ creating index on main.installments_payments ['num_instalment_version']... done ✅ creating index on main.POS_CASH_balance ['SK_ID_CURR']... done ✅ creating index on main.POS_CASH_balance ['SK_ID_PREV']... done ✅ creating index on main.previous_application ['SK_ID_CURR']... done ✅ creating index on main.previous_application ['SK_ID_PREV']... done ✅
One final check, just to make sure that all our datasets are representative and similar:
if run_entire_notebook("split_check_imbalance_on_all_splits"):
for split in ["training", "hyper", "val", "test"]:
with sql_utils.AutoClosing(f"dataset/split_{split}.sqlite") as split_conn:
c = sql_utils.Sql(split_conn)
print(split)
print(
c.peek(
"",
f"""SELECT TARGET,
COUNT(*) AS count,
COUNT(*) * 100.0 / SUM(COUNT(*)) OVER () AS percentage
FROM application_{split}
GROUP BY TARGET""",
)
)
skipping optional operation ==== 🗃️ printing cached output ==== training TARGET count percentage 0 0 197880 91.926897 1 1 17378 8.073103 hyper TARGET count percentage 0 0 28269 91.928718 1 1 2482 8.071282 val TARGET count percentage 0 0 28268 91.925466 1 1 2483 8.074534 test TARGET count percentage 0 0 28269 91.928718 1 1 2482 8.071282
Each of the 4 parts show similar %. We are confident that the split was properly randomized and stratified.
Let's take a look at the data. We're particularly interested in:
How are we going to analyze our data?
Which dataset are we going to use?
hyper), which is 10% of the overall data and $1\over7$ the size of the training dataset.From the data dictionary, we can extract:
Target variable
- 1 = client with payment "difficulties" (data dict contains more specifics)
- 0 = all other cases
In summary, 0 is the ideal result, and 1 indicates troubles.
This table is the primary table in our database. It contains the core features and target variable.
All other tables have FK referencing this table's PK.
eda_conn = sqlite3.connect("dataset/split_hyper.sqlite")
app_sample = pd.read_sql_query("SELECT * FROM application_hyper", eda_conn)
def balance(df: pd.DataFrame, target: str) -> pd.DataFrame:
rus = RandomUnderSampler(random_state=42)
X, y = rus.fit_resample(df.drop(columns=target), df[target])
return pd.concat([X, y], axis=1)
app_sample_balanced = balance(app_sample, "TARGET")
app_sample.head()
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 182534 | 0 | Cash loans | F | N | Y | 0 | 67500.0 | 332842.5 | 12676.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 |
| 1 | 226983 | 0 | Cash loans | M | N | N | 1 | 135000.0 | 354276.0 | 25803.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 176676 | 0 | Cash loans | F | Y | Y | 0 | 265500.0 | 1762110.0 | 46480.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 2.0 |
| 3 | 266436 | 0 | Cash loans | M | Y | Y | 0 | 90000.0 | 247500.0 | 16978.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 4 | 205215 | 1 | Cash loans | M | N | Y | 0 | 90000.0 | 417024.0 | 28341.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 122 columns
app_sample["TARGET"].value_counts()
0 28269 1 2482 Name: TARGET, dtype: int64
app_sample_balanced["TARGET"].value_counts()
0 2482 1 2482 Name: TARGET, dtype: int64
We want to take a quick look at our fields to understand which ones might be useful in helping predict the target variable.
We will be looking at the artificially balanced dataset (obviously). The 50% line is the baseline, anything closer to 0 indicates "better" (less difficulties) for our loan customers.
if run_entire_notebook("sweetviz_report"):
sv_report = sv.analyze(
app_sample_balanced, target_feat="TARGET", pairwise_analysis="off"
)
sv_report.show_html("cached/sweetviz_report.html")
display(IFrame("cached/sweetviz_report.html", width=1600, height=600))
skipping optional operation ==== 🗃️ printing cached output ==== Report cached/sweetviz_report.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.
A few things to highlight from this report:
Features that seem to noticeably influence TARGET:
EXT_SOURCE_1, EXT_SOURCE_2, EXT_SOURCE_3 - very strongly! (the higher, the better, for all 3 scores)AMT_REQ_CREDIT_BUREAU_YEARDAYS_BIRTHREGION_POPULATION_RELATIVEFLAG_DOCUMENT_xx). When documents are provided, the risk decreases noticeably.DAYS_LAST_PHONE_CHANGE - How long ago they changed their phone (it's unclear if they mean "changed the terminal" or "changed the phone number").YEARS_BUILD_MEDI/MODE/AVG (And other fields about the building they live in)HOUR_APPR_PROCESS_STARTAMT_ANNUITYThis is a quick visual analysis. A bit further down, we will use statistical tools to try to validate these hints.
Probably the simplest way to understand our features is to use a Filter-Based Feature Selection Technique called Missing Value Ratio. This calculation let's us see which features are more likely to be useful/useless for our model.
We will use some libraries as well as some custom-built methods to understand how our data is spread.
Further down, we will see and use more advanced methodologies to help us further with our feature selection.
app_sample_balanced
| SK_ID_CURR | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | ... | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | TARGET | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 189708 | Cash loans | M | N | Y | 0 | 135000.0 | 592560.0 | 31153.5 | 450000.0 | ... | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 0 |
| 1 | 290477 | Cash loans | M | N | N | 0 | 225000.0 | 315000.0 | 33075.0 | 315000.0 | ... | 0 | 0 | 0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 3.0 | 0 |
| 2 | 333630 | Cash loans | F | Y | Y | 0 | 148500.0 | 518562.0 | 36220.5 | 463500.0 | ... | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0 |
| 3 | 244267 | Cash loans | F | N | Y | 1 | 157500.0 | 781920.0 | 34573.5 | 675000.0 | ... | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 4.0 | 0 |
| 4 | 326845 | Cash loans | F | N | Y | 0 | 76500.0 | 663669.0 | 22063.5 | 504000.0 | ... | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 3.0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4959 | 313757 | Cash loans | M | N | Y | 0 | 157500.0 | 125640.0 | 8626.5 | 90000.0 | ... | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | 1 |
| 4960 | 281030 | Cash loans | F | Y | N | 0 | 144000.0 | 1272888.0 | 37345.5 | 1111500.0 | ... | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | 1 |
| 4961 | 340401 | Cash loans | M | Y | Y | 0 | 90000.0 | 1390500.0 | 39042.0 | 1390500.0 | ... | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | 1 |
| 4962 | 198003 | Cash loans | F | N | N | 0 | 144000.0 | 1325475.0 | 56160.0 | 1125000.0 | ... | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1 |
| 4963 | 254465 | Cash loans | F | N | Y | 0 | 135000.0 | 942300.0 | 27679.5 | 675000.0 | ... | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 1 |
4964 rows × 122 columns
def plot_missingnos_comparison(
df: pd.DataFrame, target_col: str, color=None, sort_first=True
):
if color is None or len(color) != 3:
color = (0.25, 0.25, 0.25)
f, ax = plt.subplots(1, 1, figsize=(40, 6))
target_zero = df[df[target_col] == 0]
target_one = df[df[target_col] == 1]
target_zero_nas = target_zero.isna().sum() > 0
target_one_nas = target_one.isna().sum() > 0
target_zero_nas = target_zero_nas[target_zero_nas > 0]
target_one_nas = target_one_nas[target_one_nas > 0]
cols_to_display = set(target_zero_nas.index).union(set(target_one_nas.index))
target_zero = target_zero[cols_to_display]
target_one = target_one[cols_to_display]
target_zero.columns = target_zero.columns.map(lambda n: f"{n}_target_0")
target_one.columns = target_one.columns.map(lambda n: f"{n}_target_1")
if sort_first:
zero_sorted = msno.nullity_sort(target_zero, sort="descending").reset_index(
drop=True
)
one_sorted = msno.nullity_sort(target_one, sort="descending").reset_index(
drop=True
)
else:
zero_sorted = msno.nullity_sort(target_zero, sort=None).reset_index(drop=True)
one_sorted = msno.nullity_sort(target_one, sort=None).reset_index(drop=True)
merged = zero_sorted.join(one_sorted)
sorted_cols = sorted(merged.columns)
msno.matrix(merged[sorted_cols], labels=True, ax=ax, sparkline=False, color=color)
chart_utils.rotate_x_labels(ax)
return f.gca()
@run
@cached_chart()
def eda_missing_no_matrix_comparison_target():
return plot_missingnos_comparison(
app_sample_balanced, "TARGET", color=moonstone_rgb_n
)
Loading from cache [./cached/charts/eda_missing_no_matrix_comparison_target.png]
The missingno library allows us to see that there are strong patterns across columns, where data is missing.
@run
@cached_chart()
def eda_missing_no_matrix_heatmap():
f, ax = plt.subplots(1, 2, figsize=(60, 20))
target_zero = app_sample_balanced[app_sample_balanced["TARGET"] == 0]
target_one = app_sample_balanced[app_sample_balanced["TARGET"] == 1]
target_zero_nas = target_zero.isna().sum() > 0
target_one_nas = target_one.isna().sum() > 0
target_zero_nas = target_zero_nas[target_zero_nas > 0]
target_one_nas = target_one_nas[target_one_nas > 0]
cols_to_display = set(target_zero_nas.index).union(set(target_one_nas.index))
target_zero = target_zero[cols_to_display]
target_one = target_one[cols_to_display]
msno.heatmap(target_zero, labels=True, ax=ax[0])
msno.heatmap(target_one, labels=True, ax=ax[1])
ax[0].set_title("TARGET = 0 (no difficulties)")
ax[1].set_title("TARGET = 1 (difficulties)")
plt.tight_layout()
return f
Loading from cache [./cached/charts/eda_missing_no_matrix_heatmap.png]
@run
@cached_chart()
def eda_missing_no_matrix_dendrogram():
f, ax = plt.subplots(1, 2, figsize=(30, 15))
target_zero = app_sample_balanced[app_sample_balanced["TARGET"] == 0]
target_one = app_sample_balanced[app_sample_balanced["TARGET"] == 1]
target_zero_nas = target_zero.isna().sum() > 0
target_one_nas = target_one.isna().sum() > 0
target_zero_nas = target_zero_nas[target_zero_nas > 0]
target_one_nas = target_one_nas[target_one_nas > 0]
cols_to_display = set(target_zero_nas.index).union(set(target_one_nas.index))
target_zero = target_zero[cols_to_display]
target_one = target_one[cols_to_display]
msno.dendrogram(target_zero, ax=ax[0])
msno.dendrogram(target_one, ax=ax[1])
ax[0].set_title("TARGET = 0 (no difficulties)")
ax[1].set_title("TARGET = 1 (difficulties)")
plt.tight_layout()
return f
Loading from cache [./cached/charts/eda_missing_no_matrix_dendrogram.png]
To keep the charts simpler, we are only including columns that are missing some data.
It seems that there is some pattern to the missing fields. The good news is that there seem to be very strong correlations in the missing values (the heatmap leans strongly towards +1), indicating that there are clear clusters of columns that behave similarly (missing data in one column can be used to confidently predict missing data in other columns).
The matrix and the dendrogram visualizations show the same patterns. I was looking for a clear ways to distinguish and find differences between the two datasets. Unfortunately, they are pretty close.
It's hard to dig deeper without investing considerable amounts of time: Considering that these charts are coming from the balanced dataset, the only differences are minor. For now, we will consider missing values as "interesting" but not distinct enough between the two target groups (target 0 and target 1).
The data in both splits seems to follow clear patterns and might even be "clusterable" somehow.
Something we could try (future!) would be to try using KNN to cluster these loans based on groups of features missing (eg. one dimension could be "missing data related to LIVINGAPARTMENTS_*", another dimension would be "is it missing data related to AMT_REQ_CREDIT_*").
The following features show scattered/unpredictable NAs that don't follow the patterns of the rest of the cols
OWN_CAR_AGEOCCUPATION_TYPEEXT_SOURCE_1We will mark those on the following chart, in **orange**
We use a simple bootstrap to understand our distributions of NAs.
We're also interested in seeing if there is much variation between one sample and the next, so we will plot the barchart with 95% CI
@run
@cached_chart()
def eda_plot_nas_from_bootstrap():
marks = {
"grey": [
"DAYS_BIRTH",
"REGION_POPULATION_RELATIVE",
"AMT_REQ_CREDIT_BUREAU_YEAR",
"YEARS_BUILD_MEDI",
"YEARS_BUILD_AVG",
"YEARS_BUILD_MODE",
"AMT_ANNUITY",
"HOUR_APPR_PROCESS_START",
"DAYS_LAST_PHONE_CHANGE",
],
"lightgrey": [f"FLAG_DOCUMENT_{i}" for i in range(2, 22)],
"orange": [
"EXT_SOURCE_1",
"EXT_SOURCE_2",
"EXT_SOURCE_3",
"OWN_CAR_AGE",
"OCCUPATION_TYPE",
],
}
return eda_utils.plot_nas_bootstrapped(app_sample_balanced, marked_columns=marks)
Loading from cache [./cached/charts/eda_plot_nas_from_bootstrap.png]
It seems most of the fields that we identified are fully present. Only a handful of them are missing some data.
Some of the things we wanted to try will be minorly affected by the number of NAs, so this only represents a small issue.
Another simple way to understand our features is to use another type of Filter-Based Feature Selection Technique called Information Gain or Mutual Information.
Let's take a look at which fields might be helpful to us. Remember that mutual information does not include interactions between fields and just scores them on their individual contribution (their mutual dependence).
In our case, we will score each of the fields against our target field.
Note that this custom function for mutual_info() makes some generalizations and its results are orientative (eg. it auto-selects feature with "few" unique values as "discrete" instead of continuous, it drops rows with NAs, etc...). This chart is only an initial exploration, and it only includes contributions from individual features: it does not take into account feature interactions, so "low scores" are not necessarily and indication that we should drop the feature without further analysis.
@run
@cached_chart()
def eda_mutual_information():
cols_to_encode = [
"NAME_CONTRACT_TYPE",
"CODE_GENDER",
"FLAG_OWN_CAR",
"FLAG_OWN_REALTY",
"NAME_TYPE_SUITE",
"NAME_INCOME_TYPE",
"NAME_EDUCATION_TYPE",
"NAME_FAMILY_STATUS",
"NAME_HOUSING_TYPE",
"OCCUPATION_TYPE",
"WEEKDAY_APPR_PROCESS_START",
"ORGANIZATION_TYPE",
"FONDKAPREMONT_MODE",
"HOUSETYPE_MODE",
"WALLSMATERIAL_MODE",
"EMERGENCYSTATE_MODE",
]
f, ax = plt.subplots(1, 2, figsize=(15, 7), sharex=True)
mi_u = eda_utils.mutual_info(
app_sample,
"TARGET",
id_cols=["SK_ID_CURR"],
is_classification=True,
cols_to_encode=cols_to_encode,
)
mi_b = eda_utils.mutual_info(
app_sample_balanced,
"TARGET",
id_cols=["SK_ID_CURR"],
is_classification=True,
cols_to_encode=cols_to_encode,
)
plt.title("mutual information analysis (top 20 fields)")
ax[0].set_title("using original sample (imbalanced)")
ax[1].set_title("using balanced sample")
sns.barplot(y=mi_u.head(25).index, x=mi_u.head(25).values, color=moonstone, ax=ax[0])
sns.barplot(y=mi_b.head(25).index, x=mi_b.head(25).values, color=moonstone, ax=ax[1])
plt.tight_layout()
return f
Loading from cache [./cached/charts/eda_mutual_information.png]
Remember that Mutual Information tells us how two features interact in terms of uncertainty. The mutual information (MI) between two quantities is a measure of the extent to which knowledge of one quantity reduces uncertainty about the other. Also remember that it does not cover the 0-1 scale (like correlation would do), so these values are more valuable as a comparison among them, than as a how well does it score withing the valid range.
An despite this, it seems that most features have scores that are not too great (far ahead of the rest).
It might indicate that no one feature dominates the search space and that we will probably need complex models that can capture the complexity of interactions among the various features.
We can see in some of the charts on the SweetViz report that a number of features are highly skewed, likely due to extreme outliers.
One of the key tasks we will want to perform is finding a strategy that helps us detect and remove outliers from any of the datasets.
We also want to visualize how our features are spread across the range of values.
In this section we want to:
Categorical variables are excluded (for now) from this specific wave of Feature Selection.
def read_eda(tablename: str) -> pd.DataFrame:
return auto_pandas.normalize_columns(
pd.read_sql(f"SELECT * FROM {tablename};", eda_conn)
)
def add_metadata_for_outliers(df: pd.DataFrame) -> pd.DataFrame:
"""
adds metadata indicating the min and max values if we were to ignore outliers.
The outliers are not removed, so they can still be plotted properly
"""
outl = clean.isolation_forest_outliers(
df.drop(
columns=["sk_id_prev", "sk_id_bureau", "sk_id_curr", "pk"], errors="ignore"
),
contamination=0.05,
return_outliers_only=False,
)
inliers = outl[~outl["predicted_outlier"]]
df.attrs["mins"] = inliers.min()
df.attrs["maxs"] = inliers.max()
return df
@cached_dataframes()
def eda_dfs_with_enriched_mixmax_after_removing_outliers():
tables = {
"application": read_eda("application_hyper"),
"bureau": read_eda("bureau"),
"bureau_balance": read_eda("bureau_balance"),
"credit_card_balance": read_eda("credit_card_balance"),
"installments_payments": read_eda("installments_payments"),
"pos_cash_balance": read_eda("pos_cash_balance"),
"previous_application": read_eda("previous_application"),
}
return {name: add_metadata_for_outliers(df) for name, df in tables.items()}
tables = eda_dfs_with_enriched_mixmax_after_removing_outliers()
Loading from cache [./cached/df_dict/eda_dfs_with_enriched_mixmax_after_removing_outliers.h5]
def plot_min_max(ax: plt.Axes, df: pd.DataFrame, colname: str) -> None:
inliers_min = df.attrs["mins"][colname]
inliers_max = df.attrs["maxs"][colname]
ax.axhspan(inliers_min, inliers_max, color="green", alpha=0.08)
@run
@cached_chart()
def eda_dist_app():
return chart_utils.plot_all_features(
tables["application"], ax_callback=plot_min_max, color=moonstone
)
Loading from cache [./cached/charts/eda_dist_app.png]
@run
@cached_chart()
def eda_dist_h_previous_application():
return chart_utils.plot_all_features(
tables["previous_application"], ax_callback=plot_min_max, color=moonstone
)
Loading from cache [./cached/charts/eda_dist_h_previous_application.png]
@run
@cached_chart()
def eda_dist_bureau():
return chart_utils.plot_all_features(
tables["bureau"], ax_callback=plot_min_max, color=moonstone
)
Loading from cache [./cached/charts/eda_dist_bureau.png]
@run
@cached_chart()
def eda_dist_bureau_balance():
return chart_utils.plot_all_features(
tables["bureau_balance"], ax_callback=plot_min_max, color=moonstone
)
Loading from cache [./cached/charts/eda_dist_bureau_balance.png]
@run
@cached_chart()
def eda_dist_credit_balance():
return chart_utils.plot_all_features(
tables["credit_card_balance"],
ax_callback=plot_min_max,
color=moonstone,
)
Loading from cache [./cached/charts/eda_dist_credit_balance.png]
@run
@cached_chart()
def eda_dist_installments_payments():
return chart_utils.plot_all_features(
tables["installments_payments"], ax_callback=plot_min_max, color=moonstone
)
Loading from cache [./cached/charts/eda_dist_installments_payments.png]
@run
@cached_chart()
def eda_dist_pos_balance():
return chart_utils.plot_all_features(
tables["pos_cash_balance"], ax_callback=plot_min_max, color=moonstone
)
Loading from cache [./cached/charts/eda_dist_pos_balance.png]
In these charts, we can see how the values are distributed, for all of the fields in all of the datasets.
For fields that are numeric, we are plotting histograms and we're including a light-green band. This green band shows the min and max values in the dataset, AFTER removing outliers using Isolation Forests:
In the picture below we can see how some fields are severely affected, while others are minorly changed.
This is because not all features present the same behaviour/outliers. The current cutoff points are using a 0.05 contamination threshold, but this parameter can be changed if we wanted to tweak if further. Any missing values have been imputated with the median value of the column.

This gives us an idea of which columns affect more the feeling of "Outlier" and which ones do not. We plan on using the same strategy to prepare our training dataset to help our model understand the core patterns in our data.
Remember that the outlier is being done on the entire dataset, but that the charts are only sampling 20% of the data (to speed up the rendering), as this is only to get a general understanding of viability.
We will do the actual removal on the training dataset at a later stage.
This dataset is given to us already pretty clean (even normalized across time features using relative time for all of the individual loans).
Despite this, it still has a lot of potential for further Feature Engineering.
I wanted to learn from the best, so I did a quick scan around the winning team's kaggle solution/discussion and I found some really interesting FE efforts, and I wanted to replicate some of these to practice.
The first one, particularly, is quite mindblowing. They are using 3 of the most strong features (EXT_SOURCE_1/2/3) to create a 3 dimensional space/cube, and using KNN to learn about patterns inside it. They are using Target Mean Encoding to understand, for any point in that space, how are its 500 closest neighbors performing around being able to pay the loan without issues. Well: They actually use a 4th dimension (a tesseract instead of a cube) to also encode the ratio of credit vs annuity, but for me, visualizing a 3D cube is a lot easier than a 4D tesseract, so allow me this creative simplification during this explanation).
The way they explain is:
neighbors_target_mean_500: The mean TARGET value of the 500 closest neighbors of each row, where each neighborhood was defined by the three external sources and the credit/annuity ratio
This allows them to do some very interesting version Target Mean Encoding, using its closest neighbors, allowing them to convert 4 continuous features to encode a success ratio.
Fastinating!
Anyway, I wanted to try replicating this, to practice my FE skills, but I also wanted to try to include 6 more versions of this, their simplified siblings, using only some subsets of data.
These are the features I envisioned:
| feature | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | credit/annuity ratio |
|---|---|---|---|---|
| knn_500_tme_1 | ✅ | ✅ | ||
| knn_500_tme_2 | ✅ | ✅ | ||
| knn_500_tme_3 | ✅ | ✅ | ||
| knn_500_tme_12 | ✅ | ✅ | ✅ | |
| knn_500_tme_23 | ✅ | ✅ | ✅ | |
| knn_500_tme_13 | ✅ | ✅ | ✅ | |
| knn_500_tme_all | ✅ | ✅ | ✅ | ✅ |
all the features marked with ✅ will be used for that new calc and .fillna(0) when necessary.
The reason I wanted to create the 6 simplified versions is that some of the features have a high number of NAs (particularly EXT_3) and I wanted to make sure that our model was able to still use other "fallback" versions of this information, even if one/two of the fields was missing (think of it as flattening the "cube" to a plane that is one of its sides). I imagine that they also tried this but they likely didn't get too good performance, so they did not mention them in their discussion, so I don't expect the 6 simplified versions to perform amazingly, but that's ok. This exercise is about practicing elegant ways to encode and engineer features.
I haven't decided if my model will be given all 7 combinations, or if it will be given the subset of features that we consider useful depending on the NAs it has. Eg. if the row to predict only has feature 1, 3, and credit/annuity... I haven't decided whether I want to give it all 7 features, or just the best 3 (in this case: knn_500_tme_13, knn_500_tme_1 and knn_500_tme_3)
df_fe_knns = eda_utils.balance(tables["application"].copy(), "target")
df_fe_knns["credit_annuity_ratio"] = df_fe_knns["amt_credit"] / df_fe_knns["amt_annuity"]
df_fe_knns = df_fe_knns[
["ext_source_1", "ext_source_2", "ext_source_3", "credit_annuity_ratio", "target"]
]
df_fe_knns_nas = df_fe_knns.copy()
df_fe_knns_nas[["ext_source_1"]] = df_fe_knns_nas[["ext_source_1"]].fillna(0)
df_fe_knns_nas[["ext_source_2"]] = df_fe_knns_nas[["ext_source_2"]].fillna(0)
df_fe_knns_nas[["ext_source_3"]] = df_fe_knns_nas[["ext_source_3"]].fillna(0)
feateng.target_mean_encoding_using_knn(
df_fe_knns_nas,
"target",
["ext_source_1", "ext_source_2", "ext_source_3", "credit_annuity_ratio"],
)
Let's visualize this cube from each of the sides:

@run
@cached_chart()
def feateng_knn_visualize_target_spread():
chart_utils.visualize_3d_cube(
df_fe_knns_nas,
dimX="ext_source_1",
dimY="ext_source_2",
dimZ="ext_source_3",
target="target",
title="Spread of TARGET over pairs of EXT_SOURCE_*",
elevation=30,
rotation=30,
)
return plt.gca()
filename = "cached/charts/feateng_rotate_extsources.gif"
if not os.path.exists(filename):
chart_utils.spinning_3d_plot(
df_fe_knns_nas,
dimX="ext_source_1",
dimY="ext_source_2",
dimZ="ext_source_3",
target="target",
title="Spread of TARGET over pairs of EXT_SOURCE_*",
elevation=30,
rotation=30,
export_gif_path=filename,
)
else:
print("loading pre-render")
Let's check the file we created, to get a better understanding

There seems to be a pattern to the data.
Let's try to apply KNN to do target mean encoding, and also visualize it.
This renders the actual spread of TARGET, let's try to visualize the same but for the knn(500) instead of "its own target".
Once again, we will use a subset of the features (3) so we can visualize it in space. On the actual encoding, we will use all 4 features.
# Since this dataset is a lot smaller than the Training one, we will use a smaller number of datapoints (the ratio is 1/7)
knn500 = NearestNeighbors(n_neighbors=500 // 7, algorithm="ball_tree")
display(knn500)
extscores = ["ext_source_1", "ext_source_2", "ext_source_3"]
nbrs = knn500.fit(df_fe_knns_nas[extscores])
distances, indices = knn500.kneighbors(df_fe_knns_nas[extscores])
indices = indices[:, 1:]
df_fe_knns_nas["neighbors_target_mean_knn"] = [
df_fe_knns_nas["target"].iloc[indices[i]].mean() for i in range(len(indices))
]
df_fe_knns_nas
@run
@cached_chart()
def feateng_knn_visualize_target_knn():
chart_utils.visualize_3d_cube(
df_fe_knns_nas,
dimX="ext_source_1",
dimY="ext_source_2",
dimZ="ext_source_3",
target="neighbors_target_mean_knn",
title="Spread of target_mean over pairs of EXT_SOURCE_*",
elevation=30,
rotation=30,
palette="PuOr",
)
return plt.gca()
filename = "cached/charts/feateng_rotate_extsources_knn.gif"
if not os.path.exists(filename):
chart_utils.spinning_3d_plot(
df_fe_knns_nas,
dimX="ext_source_1",
dimY="ext_source_2",
dimZ="ext_source_3",
target="neighbors_target_mean_knn",
title="Spread of target_mean over pairs of EXT_SOURCE_*",
elevation=30,
rotation=30,
palette="PuOr",
export_gif_path=filename,
)
else:
print("loading pre-render")
Let's check the file we created, to get a better understanding

The trend seems pretty obvious, but we're worried that the "flattened"/nafilled values that are on the zero plane, will likely pollute the values that are really close to them.
Since the relationship between these 3 features and the target looks quite simple (higher value is better), and it's the same for all three, we're considering using a smarter imputation. We could consider using any of the 3 features we have to impute any of the missing ones. As long as 1 of the features is present, we should be able to improve/enrich our data.
df_fe_knns = eda_utils.balance(tables["application"].copy(), "target")
df_fe_knns["credit_annuity_ratio"] = df_fe_knns["amt_credit"] / df_fe_knns["amt_annuity"]
df_fe_knns = df_fe_knns[
["ext_source_1", "ext_source_2", "ext_source_3", "credit_annuity_ratio", "target"]
]
extscores = ["ext_source_1", "ext_source_2", "ext_source_3"]
imputer = KNNImputer(n_neighbors=100)
df_fe_knns_imputed = df_fe_knns.copy()
df_fe_knns_imputed[extscores] = imputer.fit_transform(df_fe_knns[extscores])
df_fe_knns_imputed
knn500 = NearestNeighbors(n_neighbors=500 // 7, algorithm="ball_tree")
display(knn500)
extscores = ["ext_source_1", "ext_source_2", "ext_source_3"]
nbrs = knn500.fit(df_fe_knns_imputed[extscores])
distances, indices = knn500.kneighbors(df_fe_knns_imputed[extscores])
indices = indices[:, 1:]
df_fe_knns_imputed["neighbors_target_mean_knn"] = [
df_fe_knns_imputed["target"].iloc[indices[i]].mean() for i in range(len(indices))
]
@run
@cached_chart()
def feateng_knn_visualize_target_knn_imputed():
chart_utils.visualize_3d_cube(
df_fe_knns_imputed,
dimX="ext_source_1",
dimY="ext_source_2",
dimZ="ext_source_3",
target="neighbors_target_mean_knn",
title="Spread of target_mean over pairs of EXT_SOURCE_*",
elevation=70,
rotation=-30,
palette="PuOr",
)
return plt.gca()
filename = "cached/charts/feateng_rotate_extsources_knn_imputed.gif"
if not os.path.exists(filename):
chart_utils.spinning_3d_plot(
df_fe_knns_imputed,
dimX="ext_source_1",
dimY="ext_source_2",
dimZ="ext_source_3",
target="neighbors_target_mean_knn",
export_gif_path=filename,
title="Spread of target_mean over pairs of EXT_SOURCE_*",
elevation=30,
rotation=30,
palette="PuOr",
)
else:
print("loading pre-render")

It's pretty clear that this space can easily be sliced into 2 usin a single plane. Based on what we're seeing, we could consider using an SVC, just to see how it performs.
But let's not food ourselves. Let's verify it:
target_mean_0 = df_fe_knns_imputed[df_fe_knns_imputed["target"] == 0][
"neighbors_target_mean_knn"
]
target_mean_1 = df_fe_knns_imputed[df_fe_knns_imputed["target"] == 1][
"neighbors_target_mean_knn"
]
@run
@cached_chart()
def eda_difference_between_target01_after_imputing():
plt.title("Histogram of the spread of values after imputing using knn500 (scaled)")
plot_zero = sns.histplot(target_mean_0, binwidth=0.03, label="target 0")
plot_one = sns.histplot(target_mean_1, binwidth=0.03, label="target 1")
plt.legend()
return plt.gca()
And let's also apply a bit of statistical inferential analysis:
stats.ttest_ind(target_mean_0, target_mean_1)
Since the pvalue is lower than our significance level, we must reject the null hypothesis. There might be something to this imputation using KNN(500) (scaled).
However, remember that this is just the EDA / exploration for feat eng.. and that the actual Feature Engineering will be using a 4D space because it will also include the credit/annuity ratio... We don't expect it to follow any patterns or have any correlations with any of the other 3 dimensions, but let's just check, to be extra sure:
filename = "cached/charts/feateng_rotate_extsources_knn_imputed_ca_ratio.gif"
if not os.path.exists(filename):
chart_utils.spinning_3d_plot(
df_fe_knns_imputed,
dimX="ext_source_1",
dimY="ext_source_2",
dimZ="ext_source_3",
target="credit_annuity_ratio",
export_gif_path=filename,
title="Spread of credit_annuity_ratio over pairs of EXT_SOURCE_*",
elevation=30,
rotation=30,
palette="PuOr",
)
else:
print("loading pre-render")

As expected, there are no discernible colour patterns when we colour using credit_annuity_ratio instead of using target. This means that this new feature encodes new information and that using it might be beneficial.
We expect that doing the same thing in the traning dataset but using 4 dimensions instead of 3 will actually really enrich our training dataset considerably.
For now, let's look at other ways to encode and include features:
For this learning project, we would like to capture all of the information from the fact tables so our classifier has additional information. We could do things manually, or we could learn how to use a cool library that seems fitting.
The featuretools library provides an easy way to perform Deep Feature Synthesis, which seems to work promisingly well for hierarchical datasets (like the one we have).
Instead of us creating some of the features, we can use DFS to generate all of them and then assessing how well each of these contribute to our model's performance.
Before we get started, let's dump the supported types to spreadsheets so we are familiar with the transformations/aggregations it offers. Feel free to check them at your leisure if you are not familiar with DFS.
Since we plan on using dask later, the third column is also of particular interest.
ft.list_primitives().head()
if run_entire_notebook():
ft.list_primitives().to_csv("feateng/docs/ftdfs_primitives_docs.csv")
ft.list_logical_types().to_csv("feateng/docs/ftdfs_logical_types.csv")
ft.list_semantic_tags().to_csv("feateng/docs/ftdfs_semantic_tags.csv")
dask_primitives = ft.list_primitives()
dask_primitives = dask_primitives[dask_primitives["dask_compatible"]]
dask_primitives.to_csv("feateng/docs/ftdfs_primitives_docs__dask.csv")
Now that we are more familiar with DFS datatypes, we will need to prepare our data.
This preparation will allow the library to undersand our data like we do, and optimize the transformations/aggregations in ways that make sense.
This requires:
ft can find relationships between them)| DataFrame | Physical Types | Logical Types | Semantic Tags |
|---|---|---|---|
| app | ✅ | ✅ | ✅ |
| prev | ✅ | ✅ | ✅ |
| bureau | ✅ | ✅ | ✅ |
| bureau_bal | ✅ | ✅ | ✅ |
| credit_bal | ✅ | ✅ | ✅ |
| pos_bal | ✅ | ✅ | ✅ |
| installments | ✅ | ✅ | ✅ |
All of these tasks will be repeated in the "production" steps. We will export/store all of the lessons we learn so we can re-apply them identically to the other datasets as well (train/val/test).
In the cells below, you can see all of these tasks being done:
def set_primary_key(df, col: str):
df.attrs["pk_col_name"] = col
return df
def add_compound_key(df, col1: str, col2: str):
df["pk"] = df[col1].astype(str) + "__" + df[col2].astype(str)
df.attrs["pk_col_name"] = "pk"
return df
def autonumeric_pk(df):
df = df.reset_index(names="pk")
df.attrs["pk_col_name"] = "pk"
return df
Loading the data:
@cached_dataframes()
def eda_dfs_all_entities():
h_app = read_eda("application_hyper")
h_bureau = read_eda("bureau")
h_bureau_balance = read_eda("bureau_balance")
h_credit_balance = read_eda("credit_card_balance")
h_installments_payments = read_eda("installments_payments")
h_pos_balance = read_eda("pos_cash_balance")
h_prev = read_eda("previous_application")
h_app = set_primary_key(h_app, "sk_id_curr")
h_prev = set_primary_key(h_prev, "sk_id_prev")
h_bureau = set_primary_key(h_bureau, "sk_id_bureau")
h_bureau_balance = add_compound_key(
h_bureau_balance, "sk_id_bureau", "months_balance"
)
h_credit_balance = add_compound_key(h_credit_balance, "sk_id_prev", "months_balance")
h_pos_balance = add_compound_key(h_pos_balance, "sk_id_prev", "months_balance")
# has no compound de-facto unique column(s) we can use as PKs = we create ours
h_installments_payments = autonumeric_pk(h_installments_payments)
return {
"application": h_app,
"bureau": h_bureau,
"bureau_balance": h_bureau_balance,
"credit_card_balance": h_credit_balance,
"installments_payments": h_installments_payments,
"pos_cash_balance": h_pos_balance,
"previous_application": h_prev,
}
Creating an empty EntitySet that will contain the entire graph:
(es := ft.EntitySet("applications"))
Entityset: applications
DataFrames:
Relationships:
No relationships
Adding the dataframes into the EntitySet, along with their relationships
entities = eda_dfs_all_entities()
for df_name, df in entities.items():
index_col = df.attrs["pk_col_name"]
print("Adding", df_name, "using index =", index_col)
ftdfs_utils.attach_df_into_es(es, df, df_name, index_col, "feateng")
del entities
Loading from cache [./cached/df_dict/eda_dfs_all_entities.h5] Adding application using index = sk_id_curr Adding bureau using index = sk_id_bureau Adding bureau_balance using index = pk Adding credit_card_balance using index = pk Adding installments_payments using index = pk Adding pos_cash_balance using index = pk Adding previous_application using index = sk_id_prev
def rel(t1, c1, t2, c2):
return ft.Relationship(es, t1, c1, t2, c2)
relationships = [
rel("application", "sk_id_curr", "bureau", "sk_id_curr"),
rel("bureau", "sk_id_bureau", "bureau_balance", "sk_id_bureau"),
rel("application", "sk_id_curr", "previous_application", "sk_id_curr"),
rel("previous_application", "sk_id_prev", "pos_cash_balance", "sk_id_prev"),
rel("previous_application", "sk_id_prev", "installments_payments", "sk_id_prev"),
rel("previous_application", "sk_id_prev", "credit_card_balance", "sk_id_prev"),
rel("application", "sk_id_curr", "pos_cash_balance", "sk_id_curr"),
rel("application", "sk_id_curr", "installments_payments", "sk_id_curr"),
rel("application", "sk_id_curr", "credit_card_balance", "sk_id_curr"),
]
es.add_relationships(relationships)
Entityset: applications
DataFrames:
application [Rows: 30751, Columns: 122]
bureau [Rows: 146263, Columns: 17]
bureau_balance [Rows: 1474814, Columns: 4]
credit_card_balance [Rows: 322932, Columns: 24]
installments_payments [Rows: 1159358, Columns: 9]
pos_cash_balance [Rows: 852590, Columns: 9]
previous_application [Rows: 141265, Columns: 37]
Relationships:
bureau.sk_id_curr -> application.sk_id_curr
bureau_balance.sk_id_bureau -> bureau.sk_id_bureau
previous_application.sk_id_curr -> application.sk_id_curr
pos_cash_balance.sk_id_prev -> previous_application.sk_id_prev
installments_payments.sk_id_prev -> previous_application.sk_id_prev
credit_card_balance.sk_id_prev -> previous_application.sk_id_prev
pos_cash_balance.sk_id_curr -> application.sk_id_curr
installments_payments.sk_id_curr -> application.sk_id_curr
credit_card_balance.sk_id_curr -> application.sk_id_curr
Let's persist to disk the auto-detected info (logical/semantic) and check it manually.
if run_entire_notebook():
for name, df in es.dataframe_dict.items():
f_logi = f"feateng/{name}____logical_types.json"
f_semantic = f"feateng/{name}____semantic_tags.json"
ftdfs_utils.save_logical_types(es, name, f_logi)
ftdfs_utils.save_semantic_tags(es, name, f_semantic)
skipping optional operation
After a quick inspection of the files, we can see that there are a few bits of metadata that could be improved.
I'll document a couple of examples, of the many found, so you get an idea of the types of mismatches found:
Integer could actually be marked as Boolean, potentially making things a lot easier for DFS (eg, no need to encode it, etc..). We will modify the config files on disk manually to reflect these changes. All of the files have been manually reviewed and tweaked to encode these minor details that, we expect will have an impact at computation time (either improving performance, reducing computational costs or minimizing memory footprints)Some of the detected types have been tweaked manually (check the files under /feateng).
Some of these auto-detected types could be improved further (eg: we could mark some of these as Timedelta instead of Integer to indicate that these numbers are not quantities but durations in time).
However, we will not go to such lenghts right now. Let's first see how things perform with basic tweaking and let's come back if we see that the performance needs to be improved even further.
This is all nice and good, but so far we are just thinking about EDA.
Before continuing, let's make sure that DFS will work with the use cases that we have in mind.
We want to be familiar enough with the trade-offs to discard this tool, should we find a major dealbreaker.
For this, we need to spend a bit of time reviewing docs, and also assessing these tradeoffs with our intentions/data.
Below we're documenting some of the more well known issues/trade-offs found while exploring the documentation as well as our thoughts on how they might impact the viability as a key tool for the rest of the project.
Deep Feature Synthesis on Dask!¶We need to remember that this laptop is kind of old, and that the training dataset is 7x larger than the current dataset... and some of those JOINs will likely increase our memory footprint (not just by a factor of x7!), so this project looked like a good opportunity to learn about and practice with Dask.
Featuretools supports Dask for Deep Feature Synthesis!While featuretools supports Dask for Deep Feature Synthesis, it does not support all of the operations/transformations.
The official documentation reads (as of Oct-2023)
Supported Primitives When creating a feature matrix from a Dask EntitySet, only certain primitives can be used. Primitives that rely on the order of the entire DataFrame or require an entire column for computation are currently not supported when using a Dask EntitySet. Multivariable and time-dependent aggregation primitives also are not currently supported.
(It actually makes sense!)
This greatly reduces the amount of transformations that are available to us, but since this is a learning project, not a production project, we're happy to trade of some predictive performance in our model, in exchange for learning this interesting technology that seem to fit really well our problem at hand!
These missing features don't really seems a dealbreaker (so far!)
Let's continue!
DFS on Dask¶DFS seems promising, but since it's the first time we try it, we want to make sure we're familiar with some caveats and limitations that we will encounter when we use it with dask.
We've documented our findings and how some of them might impact us.
Documentation reads:
By default, Woodwork checks that pandas DataFrames have unique index values. Because performing this same check with Dask would require an expensive compute operation, this check is not performed when adding a Dask DataFrame to an EntitySet. When using Dask DataFrames, users must ensure that the supplied index values are unique.
This was actually already done during the first passes/scans around the data, to guarantee that PKs were unique. eg.:
select SK_ID_BUREAU, count(*)
from BUREAU
group by SK_ID_BUREAU
having count() > 1;
0 rows returned
This is also the reason that we created specific custom PK columns for the 3 tables that did not have guaranteed UNIQUE constraints on the IDs (check the use of add_compound_key() above for details)
When adding a DataFrame to an EntitySet, Woodwork will attempt to infer the logical types for any columns that do not have a logical type defined. This inference process can be quite expensive for Dask DataFrames
This will be handled manually to ensure that all the types are set correctly and cheaply. Check the json files on the repo for a dump of all the configurations for logical types and for semantic tags. remember: these are the files we tweaked manually earlier!
When creating a Featuretools EntitySet that will be made of Dask DataFrames, all of the DataFrames used to create the EntitySet must be of the same type, either all Dask DataFrames or all pandas DataFrames. Featuretools does not support creating an EntitySet containing a mix of Dask and pandas DataFrames.
We don't plan on mixing dask and pandas.
interesting values¶Additionally, EntitySet.add_interesting_values() cannot be used in Dask EntitySets to find interesting values; however, it can be used set a column’s interesting values with the values parameter
We plan on using the library to infer the Interesting Values on Pandas (during EDA), and then exporting those insights for the full run on Dask! (Smart? Insane? Let's find out!) 😉 Wish us luck!
There are a few key limitations when generating a feature matrix from a Dask EntitySet. If a cutoff_time parameter is passed to featuretools.dfs() it should be a single cutoff time value, or a pandas DataFrame.
Additionally, Featuretools does not currently support the use of the approximate or training_window parameters when working with Dask EntitySets
We do not plan on using this for time analysis (plus, this dataset is already really clean! All the time-related preprocessing work has already been done for us!)
Finally, if the output feature matrix contains a boolean column with NaN values included, the column type may have a different datatype than the same feature matrix generated from a pandas EntitySet
Not our situation
In some instances, generating a feature matrix with a large number of features has resulted in memory issues on Dask workers. The underlying reason for this is that the partition size of the feature matrix grows too large for Dask to handle as the number of feature columns grows large. This issue is most prevalent when the feature matrix contains a large number of columns compared to the DataFrames in the EntitySet. Possible solutions to this problem include reducing the partition size used when creating the DataFrames or increasing the memory available on Dask workers.
Oh! This is going to be a BIG problem! Let's watch out for this one, very closely!
Currently featuretools.encode_features() does not work with a Dask DataFrame as input. This will hopefully be resolved in a future release of Featuretools.
We had no idea this was a possibility. So we lost nothing... but we're still sad.
The use of featuretools.remove_low_information_features() cannot currently be used with a Dask feature matrix.
Not sure we will need it but we will try.
If required, we might try the pandas-sample-first-dask-for-prod-later approach.
When manually defining a Feature, the use_previous parameter cannot be used if this feature will be applied to calculate a feature matrix from a Dask EntitySet.
This seems to refer to creating custom Features. If we were to create some, we won't be allowed to specify timeframes (so this is just another consequence of other limitations listed above: when on Dask, total ordering and availability of all data is not guaranteed. This also impacts how Features can be created)
This will not affect us:
That's it! These are all of the trade-offs that we found when using DFS on dask. While things are not excellent, they are pretty good! The parts that we lose are very minor. The only big problem is the one about the memory footprint, but that paragraph is just a reminder that applies to all cases where we work with large amounts of data, not really a thing we would avoid if we tried to use pandas instead of dask!
So, all minor issues, and the major one would still be there even if we skipped dask. In fact, we are considering using dask precisely for that issue, so nothing is lost.
With this in mind, I think we can continue with this plan for now.
Reference - Dask Tutorial 2h
Let's do a first pass with DFS on real data and see what the output looks like.
The ultimate goal is to calculate feature importance on the featres (old and newly generated) and see if they actually contribute when predicting.
# using 1 got us 520 feats
# using 2 we got +1500 feats
# using 3 we got +5800... let's dial it down
dfs_depth = 1
prod_settings.put("dfs_depth", dfs_depth)
feats_imbalanced, feat_definitions = ft.dfs(
entityset=es,
target_dataframe_name="application",
agg_primitives=["mean", "sum", "min", "max", "count", "percent_true"],
trans_primitives=None,
max_depth=dfs_depth,
features_only=False,
)
feats_balanced = eda_utils.balance(feats_imbalanced, "target")
if run_entire_notebook("sweetviz_report_after_dfs"):
sv_report = sv.analyze(feats_balanced, target_feat="target", pairwise_analysis="off")
sv_report.show_html("cached/sweetviz_report_after_dfs.html")
display(IFrame("cached/sweetviz_report_after_dfs.html", width=1600, height=600))
skipping optional operation
Some interesting insights from these features (some of the lessons learned are from previous runs with depth=2 or =3):
previous_application.credit_card_balance.* have a high number of NAs (> 70%)It's pretty clear that we will be unable to manually assess so many columns. We want to semi-automatically discard the columns that do not help contribute to the target column.
Now that we are using the balanced subset to make sure we get a fairer assessment.
num_feats = feats_balanced.select_dtypes(exclude="category")
cat_feats = feats_balanced.select_dtypes(include="category")
hyperparams = {"subsample": 0.5, "colsample_bytree": 0.5}
lgbm_importance_df = featsel.feature_selection_lgbm(
feats_balanced,
"target",
list(set(num_feats.columns) - {"target"}),
list(set(cat_feats.columns)),
objective="binary",
hyperparams=hyperparams,
)
lgbm_importance_df.head(10)
| feature | importance | |
|---|---|---|
| 0 | ext_source_2 | 78 |
| 1 | ext_source_3 | 75 |
| 2 | ext_source_1 | 65 |
| 3 | days_birth | 62 |
| 4 | organization_type | 49 |
| 5 | MEAN(bureau.days_credit_enddate) | 40 |
| 6 | MAX(bureau.days_credit_enddate) | 35 |
| 7 | days_employed | 34 |
| 8 | days_registration | 33 |
| 9 | MEAN(previous_application.cnt_payment) | 31 |
Let's try to visualize how the feature importance spreads across all of our features:
(We don't want to analyze them visually, we just want to see the overall shape/spread)... as it turns out, it seems that the default plot cannot be customized either.
Note that in this case, we will use LGBM just because it's a bit faster than XGBoost, while both achieving similar performance (we compared them).
prod_settings.put("dfs_most_important_feats", list(lgbm_importance_df.head(100)["feature"]))
@run
@cached_chart()
def eda_feat_importance():
plt.figure(figsize=(9, 20))
return sns.barplot(
data=lgbm_importance_df.head(100),
y="feature",
x="importance",
color=moonstone,
)
<AxesSubplot: xlabel='importance', ylabel='feature'>
This seems reasonable:
This looks really good. Here's a few things we like seeing:
EXT_SOURCE_*, DAYS_BIRTH, etc..Here are some things we'd like to see:
FLAG_DOCUMENT_xx fields? They seemed important in our initial inspection! Were we wrong?lgbm_importance_df.loc[lgbm_importance_df["feature"].str.contains("docu")].head(10)
| feature | importance | |
|---|---|---|
| 275 | flag_document_3 | 2 |
| 327 | flag_document_6 | 1 |
| 345 | flag_document_18 | 1 |
| 367 | flag_document_17 | 0 |
| 378 | flag_document_4 | 0 |
| 380 | flag_document_14 | 0 |
| 381 | flag_document_7 | 0 |
| 385 | flag_document_13 | 0 |
| 391 | flag_document_2 | 0 |
| 395 | flag_document_5 | 0 |
After checking visually a second time and re-reading our original text. It seems that I was misremembering.
The original text read:
The documents (
FLAG_DOCUMENT_xx). When documents are provided, the risk decreases noticeably.
Which was true. In my head I accidentally remembered as "FLAG_DOCUMENT_xx" are helping predict the target, but due to the high imbalance of the data and how many few cases there are of people providing the documents, it doesn't help the overall case.
Basically, if you can provide the documents, you are likely to be better off... but this is so rare that using these fields for prediction will not be useful and we have to use other fields that will contribute more data!
In summary: this feature importance analysis has been very useful to avoid fooling ourselves! It is helping us spot which fields we will keep in the next steps; and in the rare cases where we saw things that didn't make sense, we were able to go back and find out that lgbm was much better at it than we were (manually)!
Let's do one last check just to be EXTRA sure. Let's compare the distribution of NAs for these generated features:
Let's also compare how the NAs are spread in this subsampled-balanced dataset of the most useful 100 features.
top_100_feats_by_importance = feats_balanced[
["target"] + list(lgbm_importance_df["feature"].head(100).values)
]
@run
@cached_chart()
def dfs_nas_after_feat_selection_split():
return msno.matrix(
top_100_feats_by_importance, sort="descending", color=moonstone_rgb_n
)
Loading from cache [./cached/charts/dfs_nas_after_feat_selection_split.png]
Not bad. Let's also compare them across the different values of TARGET, in case there were obvious patterns:
@run
@cached_chart()
def dfs_nas_after_feat_selection_split_by_target():
return plot_missingnos_comparison(
feats_balanced[["target"] + list(lgbm_importance_df["feature"].head(100).values)],
"target",
color=moonstone_rgb_n,
sort_first=True,
)
Loading from cache [./cached/charts/dfs_nas_after_feat_selection_split_by_target.png]
Nothing unexpected.
Let's also identify how the NAs stack up in "original columns" vs "newly generated columns". We will mark the DFS ones:
@run
@cached_chart()
def eda_plot_nas_after_dfs():
derived_col = top_100_feats_by_importance.columns[
top_100_feats_by_importance.columns.str.contains("\(")
]
marks = {"grey": derived_col}
ax = eda_utils.plot_nas_bootstrapped(
top_100_feats_by_importance, marked_columns=marks
)
return ax
Loading from cache [./cached/charts/eda_plot_nas_after_dfs.png]
Let's store the configuration of "top features to generate" so that, when we train, we only generate those ones, in order to save some valuable computing time and memory:
def save_dfs_features(filename: str):
pick_best_cols = 100
top_columns = list(lgbm_importance_df.head(pick_best_cols)["feature"])
generated_feats_names_all = [
feature.get_name()
for feature in feat_definitions
if ("IdentityFeature" not in str(type(feature)))
]
generated_feats_names_good = [
feat for feat in generated_feats_names_all if feat in top_columns
]
selected_features = [
feature
for feature in feat_definitions
if feature.get_name() in generated_feats_names_good
]
ft.save_features(feats_for_dfs, location=filename)
if not os.path.exists(ProdSettings.PATH_DFS_FEATURE_DEFINITIONS):
save_dfs_features(ProdSettings.PATH_DFS_FEATURE_DEFINITIONS)
print("feature definitions saved")
else:
print("file", dfs_feat_defs_filename, "already exists")
file prod/dfs_feature_definitions.json already exists
print(feats_balanced[feats_balanced["target"] == 0].shape)
print(feats_balanced[feats_balanced["target"] == 1].shape)
(2482, 521) (2482, 521)
na_comparison = pd.DataFrame(
{
"target_0": feats_balanced[feats_balanced["target"] == 0].isna().sum()[0:100],
"target_1": feats_balanced[feats_balanced["target"] == 1].isna().sum()[0:100],
}
)
na_comparison = na_comparison[
(na_comparison["target_0"] > 0) & (na_comparison["target_1"] > 0)
]
na_comparison["ratio"] = na_comparison["target_0"] / na_comparison["target_1"]
display(na_comparison.sort_values(by="ratio", ascending=False).head(10))
display(na_comparison.sort_values(by="ratio", ascending=False).tail(10))
| target_0 | target_1 | ratio | |
|---|---|---|---|
| def_60_cnt_social_circle | 12 | 4 | 3.000000 |
| obs_30_cnt_social_circle | 12 | 4 | 3.000000 |
| obs_60_cnt_social_circle | 12 | 4 | 3.000000 |
| def_30_cnt_social_circle | 12 | 4 | 3.000000 |
| occupation_type | 814 | 614 | 1.325733 |
| amt_goods_price | 5 | 4 | 1.250000 |
| ext_source_2 | 3 | 3 | 1.000000 |
| own_car_age | 1660 | 1719 | 0.965678 |
| ext_source_1 | 1370 | 1460 | 0.938356 |
| nonlivingapartments_mode | 1680 | 1852 | 0.907127 |
| target_0 | target_1 | ratio | |
|---|---|---|---|
| floorsmax_medi | 1161 | 1430 | 0.811888 |
| floorsmax_mode | 1161 | 1430 | 0.811888 |
| floorsmax_avg | 1161 | 1430 | 0.811888 |
| years_beginexpluatation_medi | 1135 | 1404 | 0.808405 |
| years_beginexpluatation_mode | 1135 | 1404 | 0.808405 |
| years_beginexpluatation_avg | 1135 | 1404 | 0.808405 |
| emergencystate_mode | 1097 | 1373 | 0.798980 |
| totalarea_mode | 1116 | 1400 | 0.797143 |
| ext_source_3 | 452 | 572 | 0.790210 |
| name_type_suite | 9 | 13 | 0.692308 |
@run
@cached_chart()
def eda_dfs_ratio_nas_after_dfs():
plt.xlim(0, 3)
plt.title("Ratio of NAs between target=0 and target=1 (close to 1 is best)")
return sns.histplot(na_comparison["ratio"], color=moonstone)
Loading from cache [./cached/charts/eda_dfs_ratio_nas_after_dfs.png]
It seems that for the vast majority of features, on a dataset that is artificially balanced, the % of NAs is the same, regardless of the value of the TARGET label.
On the number of rare situations where the % is far from 1.0, it's mostly due to the low number of cases.
Based on the data we have, we expect our model not to gain additional insights from the creation of new columns indicating "imputed but originally null" fields, so we have decided to skip this step in our data preprocessing pipeline.
The last thing we want to do before closing off the EDA part is answering this question about the manual feature engineering we did:
def enrich_knn(impute_neighbors: int, target_mean_neighbors: int) -> Pipeline:
p_enrich = Pipeline(
[
(
"drop_nas_req_annuity",
clean.DropRowsWithNas(in_cols=["amt_credit", "amt_annuity"]),
),
("add_credit_annuity_ratio", credit_clean.AddCreditAnnuity),
(
"impute_external_scores_knn",
credit_clean.KnnImputeCols(
["ext_source_1", "ext_source_2", "ext_source_3"],
n_neighbors=impute_neighbors,
),
),
(
"add_target_mean_extsources",
credit_clean.TargetMeanEncodeUsing(
cols=[
"ext_source_1",
"ext_source_2",
"ext_source_3",
"credit_annuity_ratio",
],
n_neighbors=target_mean_neighbors,
),
),
]
)
return p_enrich
Let's try to perform target encoding, but using these 4 continuous, using their $k$ neighbors.
For now, we are using a few hundreds for the number of neighbors. These are arbitrary values chosen at random, we just want to see if this helps in creating a feature that enriches our dataset in a meaningful way. Yes, if this works, and we wanted to boost our performance, we could consider tuning these parameters with Optuna to find their sweet spots.
f = feats_balanced.copy()
f["target"] = f["target"].astype("int")
es_enriched = enrich_knn(100, 200).fit_transform(f.drop(columns="target"), f["target"])
Let's visualize how well, or not, our model did. We're trying to get an idea about how our model performed (smaller errors are better)
diff = es_enriched.join(f["target"])
diff["err"] = np.abs(diff["target"] - diff["mean_target"])
@run
@cached_chart()
def feat_eng_enriched_meantarget_after_extsources():
return chart_utils.visualize_3d_cube(
diff,
dimX="ext_source_1",
dimY="ext_source_2",
dimZ="ext_source_3",
target="mean_target",
title="Spread of mean_target",
elevation=30,
rotation=30,
palette="PuOr",
)
Loading from cache [./cached/charts/feat_eng_enriched_meantarget_after_extsources.png]
The results are quite similar to our prior findings, except the range is less broad.
Let's also visualize the error for each of those points (what the real value is, vs what we predicted based on their neighbors):
@run
@cached_chart()
def feat_eng_enriched_err_after_extsources():
return chart_utils.visualize_3d_cube(
diff,
dimX="ext_source_1",
dimY="ext_source_2",
dimZ="ext_source_3",
target="err",
title="Spread of ERR",
elevation=30,
rotation=30,
palette="PuOr",
)
Loading from cache [./cached/charts/feat_eng_enriched_err_after_extsources.png]
It seems that the errors are below 0.5 (which is promising). Let's drill down a bit further:
plt.title("error (target - mean target of neighbors)")
sns.histplot(diff, x="err", color=moonstone, binwidth=0.03, multiple="stack")
plt.axvline(x=0.5)
<matplotlib.lines.Line2D at 0x7fe1532a2970>
It does seem that most of the values fall below 0.5, indicating that there is some information encoded in this field (better than random chance).
Let's check that. What? are you saying we could use a T-test?
Alright, let's give it a try:
stats.ttest_ind(target_mean_0, target_mean_1)
Ttest_indResult(statistic=-27.915856712576, pvalue=2.0329687910962912e-159)
Since the pvalue is lower than our significance level, we must reject the null hypothesis.
But, there's a plot twist!
Unfortunately, we cannot use standard ttests because this distribution does not pass our requirements for "normality":
e = diff["err"]
stats.kstest((e - e.mean()) / e.std(), "norm")
KstestResult(statistic=0.06461382859316356, pvalue=1.8422604526814796e-18)
p_value < 0.05 - we must reject the $H_0$ - This distribution is not normal, based on the Kolmogorov-Smirnov test
Since Kolmogorov Smirnov suggests that the distribution is not normal, we cannot take the result of the first ttest (above).
Let's try to inspect it visually:
It also seems quite shifted, if we compare (with a cutoff point around 0.5)
sns.histplot(diff, x="err", color=moonstone, binrange=[0, 1], bins=2, multiple="stack")
plt.axvline(x=0.5)
<matplotlib.lines.Line2D at 0x7fe14a915100>
half = np.sign(e - 0.5)
half.value_counts()
-1.0 3162 1.0 1802 Name: err, dtype: int64
We could use Wilcoxon, which does not require normality, but the data is not symmetric... so let's also ignore this:
stats.wilcoxon(e - 0.5)
WilcoxonResult(statistic=3902853.5, pvalue=7.789745380488737e-111)
Again, even despite the minuscule p_value, we must ignore it, as we do not meet the required conditions for Wilcoxon: symmetry!
So far, it seems that all none of the tests have been able to give us an answer that satisfies us.
Let's leave statistical inference and try to calculate whether we should include this metric or not through an empirical test: Does it provide information that helps us predict our actual target?
We can use Mutual Information, to achieve this:
chart_utils.visualize_3d_cube(
diff,
dimX="ext_source_1",
dimY="ext_source_2",
dimZ="ext_source_3",
target="mean_target",
title="Spread of TARGET over pairs of EXT_SOURCE_*",
elevation=30,
rotation=30,
palette="PuOr",
)
__
@run
@cached_chart()
def eda_feat_importance_including_enriched_extsource():
df = diff.drop(columns="err")
num_feats = df.select_dtypes(exclude="category")
cat_feats = df.select_dtypes(include="category")
hyperparams = {"subsample": 0.5, "colsample_bytree": 0.5}
imp = featsel.feature_selection_lgbm(
df,
"target",
list(set(num_feats.columns) - {"target"}),
list(set(cat_feats.columns)),
objective="binary",
hyperparams=hyperparams,
)
markers = ["mean_target", "credit_annuity_ratio"]
p = sns.barplot(
data=imp.head(25),
y="feature",
x="importance",
color=moonstone,
)
for marker in markers:
row_mean_target = imp[imp["feature"] == marker].index[0]
x = imp[imp["feature"] == marker]["importance"]
plt.plot(x + 5, row_mean_target, marker="<", color="orange", markersize=8)
return p
Loading from cache [./cached/charts/eda_feat_importance_including_enriched_extsource.png]
It seems that the 2 new features we incorporated are actually contributing strongly to our model.
Let's keep them.
In this file:
In the next notebook, we will build a few artifacts that will help us automate this data preprocessing (data cleaning, DFS, outlier detection and removal, etc..).
Whenever you're ready, follow me into the next notebook!